-
Notifications
You must be signed in to change notification settings - Fork 68
[benchmarks] Reworked the conversion benchmark and added more tests for up/down casts #4800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
b5381cf
to
73136d9
Compare
y = tl.cast(y, y_type, bitcast=True) | ||
tl.store(y_ptr + offsets, y, mask=mask) | ||
|
||
kernel.__name__ = kernel.__qualname__ = name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why we want to change the kernel name here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes much easier finding and comparing the kernel IRs in the cache. For example, I want to compare the rtne and rtz IRs:
$ find ~/.triton/cache/ -name '*fp32_to_fp16*.llir'
/home/jovyan/.triton/cache/PXX2VZY5SPACCMUKHEIEDQEEB2CN2A562JHDMIVPLHTS6F6LSRAA/rtne_fp32_to_fp16_conversion_kernel.llir
/home/jovyan/.triton/cache/CYCEIIP4OQCRWTMZHZHLPXO63ZR5J7ZEPN4KCM72NVOITFO5FMRA/rtz_fp32_to_fp16_conversion_kernel.llir
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can keep this in develop stage, since from the production perspective, there is not performance or visualization improvement from this change. But it breaks the code structure to be incosistant with other benchmarks, which reduces code readability.
Without this change, we also do not need lru_cache
, _make_kernel
and could keep using triton.jit
as decorator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very convenient using benchmarks for development and testing. This bench produces 20 kernels and it's not easy to recognize the required kernel in the cache.
tl.store(y_ptr + offsets, y, mask=mask) | ||
|
||
kernel.__name__ = kernel.__qualname__ = name | ||
return triton.jit(kernel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest using triton.jit
in decorator as other benchmarks do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The decorator does not allow changing the name.
if r == 'rtne' or s == tl.float32] | ||
|
||
|
||
@lru_cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Triton has its cache mechanism. You can check the kernel jit cache under TRITON_CACHE_DIR
if specified. We do not need to use lru_cache
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but JITFunction object is created on each call to _make_kernel(). I'm not sure if it has significant impact on the bench results.
return gbps(ms), gbps(max_ms), gbps(min_ms) | ||
|
||
|
||
def _report(plot_name, line_names, line_vals): | ||
report = triton.testing.perf_report( | ||
triton.testing.Benchmark( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's recommended to use triton_kernels_benchmark
instead of triton.testing.Benchmark
to get the accurate kernel time cost.
It's better to use perf_report and benchmark in python decorator form to keep consistent with other benchmarks.
If you want to add more arguments to the benchmarks, you can refer to
intel-xpu-backend-for-triton/benchmarks/triton_kernels_benchmark/gemm_benchmark.py
Lines 298 to 342 in 4f35147
def get_benchmark( | |
providers_filter: Optional[list[str]] = None, | |
transpose_a=False, | |
transpose_b=False, | |
matmul_kernel=matmul_kernel_with_block_pointers, | |
matmul_kernel_batched=matmul_kernel_with_block_pointers_batched, | |
plot_name='matmul-performance', | |
): | |
""" | |
Returns a Mark object containing a Benchmark object constructed at runtime and parameterized by the provided option values. | |
The benchmark can then be executed by calling the :code:`.run` method on the return value. | |
""" | |
supported_providers = { | |
'triton': 'Triton', | |
'onednn': 'OneDNN', | |
} | |
# use_cutlass | |
if not (transpose_a or transpose_b): | |
if torch.xpu.get_device_name() != 'Intel(R) Arc(TM) Graphics': | |
# FIXME: enable cutlass on LNL | |
supported_providers['cutlass'] = 'CUTLASS' | |
providers = benchmark_suite.filter_providers(supported_providers, providers_filter) | |
# Benchmark Performance | |
# pylint: disable=too-many-branches | |
@benchmark_suite.perf_report( | |
benchmark_suite.Benchmark( | |
# argument names to use as an x-axis for the plot | |
x_names=['B', 'M', 'N', 'K'], | |
# different possible values for `x_name` | |
x_vals=X_VALS, | |
line_arg='provider', | |
# argument name whose value corresponds to a different line in the plot | |
# possible values for `line_arg`` | |
line_vals=list(providers.keys()), | |
# label name for the lines | |
line_names=list(providers.values()), | |
# line styles | |
styles=[('green', '-'), ('green', '--'), ('blue', '-'), ('blue', '--')], | |
ylabel=['GB/s', 'TFlops'], # label name for the y-axis | |
plot_name=plot_name, | |
# name for the plot. Used also as a file name for saving the plot. | |
args={}, | |
)) | |
def benchmark(B, M, N, K, provider): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to triton_kernels_benchmark, but now there are much more results. I've grouped them in 8 benchmarks, one for each pair of types:
fp8e4nv-fp16:
N fp8e4nv->fp16-GB/s fp16->fp8e4nv-rtne-GB/s fp8e4nv->fp16-GB/s-min fp16->fp8e4nv-rtne-GB/s-min fp8e4nv->fp16-GB/s-max fp16->fp8e4nv-rtne-GB/s-max fp8e4nv->fp16-CV fp16->fp8e4nv-rtne-CV
0 4096.0 0.009099 0.024323 0.007853 0.024311 0.010699 0.024335 0.119163 0.000359
1 16384.0 0.042830 0.097263 0.042824 0.097246 0.042845 0.097292 0.000268 0.000246
2 65536.0 0.171211 0.388892 0.171152 0.388799 0.171237 0.388984 0.000311 0.000254
3 262144.0 0.514351 1.515020 0.514218 1.512835 0.514573 1.517318 0.000333 0.001386
4 1048576.0 1.211834 4.166472 1.206658 4.166042 1.215926 4.167631 0.005078 0.000204
5 4194304.0 1.627807 7.824465 1.627375 7.666599 1.628487 7.990247 0.000360 0.021691
6 16777216.0 1.961954 10.152995 1.960651 10.151963 1.963331 10.153733 0.000892 0.000100
7 67108864.0 2.121951 10.910362 2.121518 10.909286 2.122361 10.911670 0.000233 0.000094
fp8e4nv-bf16:
N fp8e4nv->bf16-GB/s bf16->fp8e4nv-rtne-GB/s fp8e4nv->bf16-GB/s-min bf16->fp8e4nv-rtne-GB/s-min fp8e4nv->bf16-GB/s-max bf16->fp8e4nv-rtne-GB/s-max fp8e4nv->bf16-CV bf16->fp8e4nv-rtne-CV
0 4096.0 0.010431 0.020689 0.010428 0.020679 0.010432 0.020704 0.000189 0.000471
1 16384.0 0.041715 0.082722 0.041694 0.082707 0.041728 0.082747 0.000361 0.000286
2 65536.0 0.166801 0.330556 0.166707 0.330429 0.166843 0.330723 0.000373 0.000418
3 262144.0 0.508169 1.243508 0.507339 1.241306 0.508504 1.246121 0.001142 0.001853
4 1048576.0 1.170913 2.298198 1.170286 2.297332 1.171415 2.299105 0.000427 0.000372
5 4194304.0 1.590390 4.414360 1.588078 4.253789 1.592187 4.589115 0.001110 0.040270
6 16777216.0 1.927215 5.541919 1.925484 5.541688 1.928841 5.542215 0.000827 0.000049
7 67108864.0 2.078866 5.955846 2.078395 5.955534 2.079470 5.956143 0.000249 0.000041
fp8e4nv-fp32:
N fp8e4nv->fp32-GB/s fp32->fp8e4nv-rtne-GB/s fp32->fp8e4nv-rtz-GB/s fp8e4nv->fp32-GB/s-min fp32->fp8e4nv-rtne-GB/s-min fp32->fp8e4nv-rtz-GB/s-min fp8e4nv->fp32-GB/s-max fp32->fp8e4nv-rtne-GB/s-max fp32->fp8e4nv-rtz-GB/s-max fp8e4nv->fp32-CV fp32->fp8e4nv-rtne-CV fp32->fp8e4nv-rtz-CV
0 4096.0 0.010362 0.028596 0.220511 0.010355 0.028587 0.220215 0.010369 0.028603 0.220785 0.000572 0.000277 0.001602
1 16384.0 0.041439 0.114354 0.879678 0.041421 0.114350 0.878593 0.041457 0.114356 0.880860 0.000401 0.000099 0.001598
2 65536.0 0.165713 0.456808 3.507412 0.165682 0.456735 3.506849 0.165763 0.456913 3.508351 0.000286 0.000255 0.000757
3 262144.0 0.500332 1.804281 12.206938 0.499604 1.803710 12.181413 0.500887 1.805399 12.231430 0.001165 0.000460 0.001846
4 1048576.0 1.193028 6.527287 31.324152 1.192344 6.523916 31.237370 1.194213 6.531394 31.402012 0.001297 0.000556 0.002341
5 4194304.0 1.599378 12.358361 46.772281 1.598380 12.115543 46.425925 1.600313 12.566826 47.254437 0.000648 0.014158 0.025110
6 16777216.0 1.926104 18.816976 55.924057 1.925887 18.813263 55.880840 1.926340 18.820691 55.949418 0.000105 0.000181 0.000633
7 67108864.0 2.082396 18.749761 57.819820 2.082237 18.684158 57.809411 2.082518 18.789707 57.828937 0.000069 0.002619 0.000188
fp8e5-fp16:
N fp8e5->fp16-GB/s fp16->fp8e5-rtne-GB/s fp8e5->fp16-GB/s-min fp16->fp8e5-rtne-GB/s-min fp8e5->fp16-GB/s-max fp16->fp8e5-rtne-GB/s-max fp8e5->fp16-CV fp16->fp8e5-rtne-CV
0 4096.0 0.090101 0.073537 0.089762 0.073458 0.090459 0.073584 0.003488 0.001016
1 16384.0 0.360405 0.293673 0.359298 0.293326 0.360817 0.293915 0.002256 0.001314
2 65536.0 1.435304 1.172799 1.434174 1.171625 1.437193 1.173639 0.003717 0.001082
3 262144.0 5.674113 4.018765 5.645762 4.015686 5.688889 4.020613 0.004035 0.000635
4 1048576.0 12.052598 5.826393 12.042631 5.817665 12.069245 5.833200 0.001300 0.001061
5 4194304.0 20.661597 11.960147 19.493159 11.722220 21.315986 12.711861 0.058014 0.036636
6 16777216.0 29.556085 14.154166 29.527371 14.151587 29.590698 14.155981 0.000950 0.000153
7 67108864.0 31.602352 15.201539 31.593605 15.141837 31.612656 15.238494 0.000372 0.003269
fp8e5-bf16:
N fp8e5->bf16-GB/s bf16->fp8e5-rtne-GB/s fp8e5->bf16-GB/s-min bf16->fp8e5-rtne-GB/s-min fp8e5->bf16-GB/s-max bf16->fp8e5-rtne-GB/s-max fp8e5->bf16-CV bf16->fp8e5-rtne-CV
0 4096.0 0.010617 0.020971 0.010614 0.020965 0.010618 0.020975 0.000192 0.000310
1 16384.0 0.042465 0.083883 0.042451 0.083859 0.042476 0.083900 0.000380 0.000310
2 65536.0 0.169765 0.335034 0.169733 0.334914 0.169818 0.335188 0.000293 0.000341
3 262144.0 0.508051 1.238690 0.507874 1.218866 0.508189 1.260308 0.000259 0.016453
4 1048576.0 1.166588 2.303297 1.165706 2.303064 1.167035 2.303631 0.000671 0.000161
5 4194304.0 1.574403 4.553186 1.572210 4.525967 1.575782 4.597888 0.000999 0.027275
6 16777216.0 1.908402 5.553034 1.906225 5.552781 1.910218 5.553339 0.000954 0.000059
7 67108864.0 2.059672 5.925414 2.058876 5.922217 2.060320 5.930591 0.000308 0.002324
fp8e5-fp32:
N fp8e5->fp32-GB/s fp32->fp8e5-rtne-GB/s fp32->fp8e5-rtz-GB/s fp8e5->fp32-GB/s-min fp32->fp8e5-rtne-GB/s-min fp32->fp8e5-rtz-GB/s-min fp8e5->fp32-GB/s-max fp32->fp8e5-rtne-GB/s-max fp32->fp8e5-rtz-GB/s-max fp8e5->fp32-CV fp32->fp8e5-rtne-CV fp32->fp8e5-rtz-CV
0 4096.0 0.045230 0.051425 0.346238 0.045214 0.051406 0.345479 0.045246 0.051437 0.347354 0.000944 0.000355 0.003939
1 16384.0 0.180759 0.205635 1.384368 0.180600 0.205622 1.383784 0.180919 0.205643 1.384719 0.000944 0.000178 0.001195
2 65536.0 0.720809 0.821871 5.509542 0.718344 0.821583 5.497986 0.722399 0.822077 5.520216 0.002619 0.000379 0.002502
3 262144.0 2.875647 3.222620 19.814360 2.874386 3.213494 19.775497 2.879438 3.232515 19.859394 0.000812 0.002902 0.003130
4 1048576.0 8.786459 10.241500 58.596032 8.780279 10.234403 58.383962 8.799141 10.248006 58.776682 0.001510 0.000658 0.002911
5 4194304.0 15.368254 18.575305 89.136205 14.555469 18.125147 88.234263 16.153809 19.050470 89.944760 0.040169 0.027046 0.017450
6 16777216.0 21.927561 25.998892 110.101176 21.904659 25.994424 109.902104 21.952351 26.003124 110.295157 0.001060 0.000165 0.001898
7 67108864.0 23.479745 27.945724 118.175422 23.471203 27.936418 118.097838 23.491448 27.958579 118.261008 0.000419 0.001028 0.000936
fp16-fp32:
N fp16->fp32-GB/s fp32->fp16-rtne-GB/s fp32->fp16-rtz-GB/s fp16->fp32-GB/s-min fp32->fp16-rtne-GB/s-min fp32->fp16-rtz-GB/s-min fp16->fp32-GB/s-max fp32->fp16-rtne-GB/s-max fp32->fp16-rtz-GB/s-max fp16->fp32-CV fp32->fp16-rtne-CV fp32->fp16-rtz-CV
0 4096.0 0.151311 0.495884 0.640500 0.150943 0.492308 0.635236 0.151929 0.499512 0.644836 0.003135 0.011283 0.007796
1 16384.0 0.601910 1.978744 2.520615 0.600234 1.967339 2.497561 0.602707 1.988349 2.531520 0.002605 0.005164 0.006579
2 65536.0 2.398829 7.886402 10.105782 2.395322 7.876923 10.039216 2.402346 7.914976 10.176397 0.001565 0.002228 0.006945
3 262144.0 9.556836 30.840469 36.459528 9.547785 30.624299 36.368480 9.581287 30.942398 36.653244 0.001869 0.005031 0.004931
4 1048576.0 32.911989 66.788275 85.773093 32.572565 66.534007 84.836246 33.098991 67.216412 86.802651 0.007563 0.004307 0.009884
5 4194304.0 52.632751 111.535810 125.427760 51.940559 108.323965 124.356737 54.398006 116.095662 126.884793 0.020584 0.029834 0.009612
6 16777216.0 83.593503 183.247404 166.796403 83.332753 182.726495 165.182107 83.919648 183.590298 168.014093 0.002849 0.002184 0.008465
7 67108864.0 93.211930 206.396724 183.397635 93.128115 205.986829 183.197372 93.279308 206.789128 183.574224 0.000715 0.001967 0.001007
bf16-fp32:
N bf16->fp32-GB/s fp32->bf16-rtne-GB/s fp32->bf16-rtz-GB/s bf16->fp32-GB/s-min fp32->bf16-rtne-GB/s-min fp32->bf16-rtz-GB/s-min bf16->fp32-GB/s-max fp32->bf16-rtne-GB/s-max fp32->bf16-rtz-GB/s-max bf16->fp32-CV fp32->bf16-rtne-CV fp32->bf16-rtz-CV
0 4096.0 0.216262 0.496485 0.140877 0.215852 0.494686 0.140659 0.217133 0.497570 0.141047 0.003744 0.003428 0.001148
1 16384.0 0.858250 1.979939 0.562637 0.856904 1.976834 0.561866 0.860504 1.988349 0.563411 0.002169 0.003098 0.001038
2 65536.0 3.427615 7.867467 2.246307 3.413333 7.839234 2.243769 3.442017 7.884504 2.248079 0.003163 0.003395 0.001256
3 262144.0 13.554499 30.464150 8.862205 13.518151 30.340742 8.853823 13.596681 30.538676 8.868200 0.004282 0.004607 0.001022
4 1048576.0 27.018191 63.879128 18.462470 26.969546 63.689016 18.440067 27.092187 64.125246 18.476459 0.002321 0.004308 0.001328
5 4194304.0 41.347635 109.755436 29.842077 41.139987 103.226614 28.489266 41.689568 116.095650 30.753634 0.005780 0.056834 0.043698
6 16777216.0 66.565688 172.773959 46.072265 66.323594 172.434798 46.043690 66.741518 173.189538 46.121663 0.002899 0.002157 0.000737
7 67108864.0 73.298160 193.933828 49.872075 73.071499 193.687548 49.631154 73.524945 194.266189 50.059130 0.002932 0.001379 0.003740
0b17de9
to
4cc1041
Compare
4cc1041
to
c1d4de6
Compare
The benchmarks failed with OOME, but I see the same issue on the main branch https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/17038905956/job/48297736887 . |
It should be fixed by #4913. |
c1d4de6
to
a1688f8
Compare
Closes #4799
Here is an example of the benchmark report:
And the benchmark run results - https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/16596562534/job/46944550498